Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora
نویسندگان
چکیده
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly inter-connected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-111877 Veröffentlichte Version Originally published at: Graën, Johannes; Clematide, Simon (2015). Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora. In: 3rd Workshop on the Challenges in the Management of Large Corpora, Lancaster, 20 Juli 2015 20 Juli 2015, 15-20. Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora Johannes Graën Institute of Computational Linguistics University of Zurich Zurich, Switzerland [email protected] Simon Clematide Institute of Computational Linguistics University of Zurich Zurich, Switzerland [email protected]
منابع مشابه
Slate - A Tool for Creating and Maintaining Annotated Corpora
Recent research trends of the last five years show that richly annotated corpora inspire novel research. These richly annotated corpora are indispensable for progressing research, but also more difficult to manage and maintain due to increasing complexity – what is needed is a way to manage the annotation project in its entirety. However, annotation project management has received little attent...
متن کاملExploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus
In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The trans...
متن کاملEvaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus
In this paper we illustrate and evaluate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. Th...
متن کاملParallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration
The interest in syntactically-annotated data for improving machine translation quality has spurred the growing demand for parallel aligned treebank data. To meet this demand, the Linguistic Data Consortium (LDC) has created large volume, multi-lingual and multi-level aligned treebank corpora by aligning and integrating existing treebank annotation resources. Such corpora are more useful when th...
متن کاملTranslation as Annotation
In this paper we illustrate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the key notion that translating a text can be seen as a linguistic annotation task which is easier than manual annotation with formal schemes. After translation, formal annotations can be automatically derived...
متن کامل